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Abstract 



Statistical distance measures have found wide applicability in information retrieval tasks that 
typically involve high dimensional datasets. In order to reduce the storage space and ensure effi- 



cient performance of queries, dimensionality reduction while preserving the inter-point similarity 
is highly desirable. In this paper, we investigate various statistical distance measures from the 
point of view of discovering low distortion embeddings into low-dimensional spaces. More specif- 
ically, we consider the Mahalanobis distance measure, the Bhattacharyya class of divergences 
and the Kullback-Leibler divergence. We present a dimensionality reduction method based on 
the Johnson-Lindenstrauss Lemma for the Mahalanobis measure that achieves arbitrarily low 
distortion. By using the Johnson-Lindenstrauss Lemma again, we further demonstrate that the 
\Q , Bhattacharyya distance admits dimensionality reduction with arbitrarily low additive error. We 

also examine the question of embeddability into metric spaces for these distance measures due to 
the availability of efficient indexing schemes on metric spaces. We provide explicit constructions 
Q^ ■ of point sets under the Bhattacharyya and the Kullback-Leibler divergences whose embeddings 

into any metric space incur arbitrarily large distortions. We show that the lower bound pre- 
sented for Bhattacharyya distance is nearly tight by providing an embedding that approaches 
the lower bound for relatively small dimensional datasets. 



1 Introduction 

The problem of embedding distance measures into normed spaces arises in applications dealing with 
huge amounts of high-dimensional data where performing point, range or nearest-neighbor (NN) 
queries in the ambient space entails enormous computational costs. The prohibitive costs arise in 
most cases due to two factors. First, calculating pairwise distances and answering proximity queries 
such as fc-NN queries becomes costlier as the dimensionality of the data increases (a phenomenon 
more commonly known by the epithet "curse of dimensionality"). Second, in many cases such 
as image databases, the imposed distance measures such as the earth mover's distance [RTGOO 
are inherently expensive to compute. The problem of indexing and searching is magnified if the 
distance measures do not form a metric. 



"The short version of this paper was accepted for presentation at the 20 th International Conference on Database 
and Expert Systems Applications, DEXA 2009. 
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Various approaches have been proposed to solve this problem. These include obtaining easily 
estimable upper/lower-bounds on the distance measures [LBS06, RTGOO] and finding embeddings 
which allow specific proximity queries to be efficiently solved [CL08, IT03]. These methods have 
been found to be crucial for database retrieval algorithms in obtaining speedups over naive search 
techniques. This is mainly due to the fact that, in many of these examples, the lower/upper-bounds 
or the embeddings involve an l p metric for which there exist efficient algorithms for answering 
proximity queries jSam05] . 

The concept of a "low distortion" embedding (defined in Section [2]) is very useful in this context. 
A low distortion embedding ensures that notions of distance are preserved almost intact. This is 
certainly most desirable since it gives performance guarantees in terms of accuracy for all proximity 
queries by preserving the geometry of the original space almost exactly as against many of the 
methods cited above which are optimized for specific types of queries. The presence of several 
index structures for metric spaces (especially for the Euclidean space) [Sam05j makes it natural to 
try to approximate a given distance measure by a metric distance (especially Euclidean distance). 
Getting an approximation of the original distance measure in terms of a metric distance also shows 
how inherently "geometric" is the distance measure. In fact if the new metric distance is I2, 
there is the added benefit of being able to drastically reduce the dimensionality of the objects in 
the database while incurring a very small distortion via the Johnson-Lindenstrauss lemma [JL84J. 
Apart from its practical importance, such a result that quantifies the relation between the distance 
measures is of theoretical interest as well, and results of this nature, especially when both the 
distance measures are metrics, have gained huge interest in theoretical computer science over the 
past few years [Mat02j. 

A more interesting, and often more difficult, situation arises in case of statistical distance 
measures. They are widely used in database and pattern recognition applications and typically 
involve high-dimensional data. It has been found that in many scenarios, especially in similar- 
ity based search in image retrieval [RBD05J, statistical distance measures like the Mahalanobis 
[Mah36] and Bhattacharyya [Bha43j measures give better performance than the standard I2 dis- 
tance. In the field of bioinformatics, the Mahalanobis distance measure has been found to be 
more useful than the I2 distance when the distance between two DNA sequences is measured by 
simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) 
in the two sequences [WBD97] . Furthermore, the Mahalanobis distance has found application in 
face recognition as well [FHVW03]. The Bhattacharyya class of distance measures which include 
the Bhattacharyya distance and the Hellinger distance are also widely used in diverse database 
scenarios such as nearest-neighbor classification [LS99] . detecting voice over IP floods [SWWJ08], 
and recognition tasks ( 'Ii MOO . However, it should be noted that often applications of the Bhat- 
tacharyya distance use the distance measure to calculate the dissimilarity between multivariate 
normal distributions; we do not address that in this paper. Another important and widely used 
statistical similarity measure is the Kullback-Leibler divergence [Kul59| . It has been shown that the 
Kullback-Leibler divergence is well suited for use in real-time image segmentation algorithms and 
time-critical texture classification and retrieval from large databases [MSB02]. It has also been used 
in machine learning to design pattern classifiers for high-dimensional image spaces [LS03 . Apart 
from its applications this measure is interesting from a theoretical perspective as well because of 
its information-theoretic roots. 

These distance measures have received a lot of attention recently and have been examined 
from several perspectives including clustering [CM08], NN-retrieval |Cay0 8|, computing Voronoi 
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diagrams [NBN07] ■ and sketching [GIM07]. We examine these distance measures from another 
interesting perspective — that of low distortion embeddings into metric spaces and dimensionality 
reduction. The lack of inherent "geometric" properties make them harder candidates for low dis- 
tortion and low-dimensionality embeddings into metric spaces. To the best of our knowledge, this 
is the first investigation into these distance measures from the point of view of dimensionality re- 
duction and embeddability into metric spaces. 

Our Contributions: In this paper, we examine three statistical distance measures with the goal of 
obtaining low distortion, low-dimensional embeddings for them. The paper is structured as follows. 
In Section [21 we provide definitions and well known results that will be used in the subsequent 
discussion. In Section [31 we consider the Bhattacharyya distance and prove that there cannot 
exist low distortion embeddings for the Bhattacharyya distance into a metric space. We develop a 
technique which shows that the lower bound on the distortion gets larger as we include distributions 
that are "farther" from the uniform distribution in the probability simplex. We also provide an 
embedding into the space (which forms a metric in the positive orthant) that approaches the 
lower bound for small dimensions. 

In Section HI using the technique developed in the previous section, we show a similar result for 
the Kullback-Leibler divergence. We also develop another technique that gives better bounds on 
the distortion when the set of distributions under consideration does not contain distributions that 
are "far away" from the uniform distribution. The section ends with a result relating the Kullback- 
Leibler divergence and the l\ measure. Finally, in Section 02 we provide low distortion embeddings 
for the family of Quadratic Form Distances. As a special case, we show that the Mahalanobis 
metric also admits a low distortion embedding. We end the paper with some concluding remarks 
in Section [6l 

2 Preliminaries 

We begin by defining a few preliminary concepts related to metric spaces and embeddings. 

Definition 1 (Metric Space). A pair M = (X, p) where X is a set and p : X x X — ► R + U {0} is 
called a metric space provided the distance measure p satisfies the properties of identity, symmetry 
and triangular inequality. 

Definition 2 (D-embedding and Distortion). Given two metric spaces (X, p) and (Y, a), a mapping 
f : X — ► Y is called a D -embedding where D > 1, if there exists a number r > such that for all 
x,y G X, 

r ■ p(x, y) < a (f(x), f{y)) <Dr- p{x, y) 
The infimum of all numbers D such that f is a D-embedding is called the distortion of f. 

In general, the factor r is intended to allow the distances to be scaled by some constant factor 
and does not affect the definition of the embedding (see [Mat02] for details) . It is easy to see that 
this notion of distortion can be naturally extended to non-metric spaces as well. 

A classic result widely used in the field of metric embeddings is the Johnson-Lindenstrauss 
Lemma [JL84] which makes it possible for large point sets in high-dimensional Euclidean spaces to 
be embedded into low-dimensional Euclidean spaces with arbitrarily small distortion. 
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Theorem 1 (Johnson-Lindenstrauss Lemma [Mat02]). Let X be an n-point set in a d- dimensional 
Euclidean space (i.e. (X,h) C {^ d ,h)), and let e G (0,1] be given. Then there exists a (1 + e)- 
embedding of X into (R fc ,/2) where k = 0(e _2 logn). Furthermore, this embedding can be found 
out in randomized polynomial time. 

The main idea being used here is that the length of a <i-dimensional vector when randomly 
projected onto a lower &:-dimensional space is sharply concentrated around its expected value. 
This allows us to show the existence of a low distortion embedding as well as obtain a random- 
ized algorithm to discover the embedding via random projections. The original proof of Johnson 
and Lindenstrauss [JL84J has been greatly simplified by the algorithmic proofs of Gupta-Dasgupta 
[GD02], Indyk-Motwani [IM98] and Arriaga-Vempala [VA06]. However, all these techniques involve 
sampling from continuous distributions which make them somewhat impractical for database pur- 
poses where one would like to perform operations via simple SQL queries. This problem was solved 
by a beautiful result due to Achlioptas [AchOTj who showed that one can in fact use a projection 
matrix with each entry chosen independently from the distribution U{— 1, +1}. (In fact, as shown 
in |VA06] . any distribution that is symmetric about the origin with the second moment as unity 
and bounded higher even moments can be used.) This is most suited to a database application 
where the random projection can be applied by simply splitting the attribute set into two halves 
by first sampling and summing up the attributes in each set, and then taking the difference of the 
two sums, and finally repeating this as many times as is the dimensionality of the projected space. 
We now state the main result of Achlioptas [AchOl] which assures that our algorithmic results are 
readily applicable to database situations as well. 

Lemma 1 ( [AchOlj ). Let R = (r^) be a random d x k matrix, such that each entry r,j is chosen 
independently according to U{+1,— 1}. For any fixed unit vector u 6 M. d , and any e > 0, let 

u' = ^(R T u). Then, E [\\u'\\ 2 ] = 1 = |M| 2 and Pr [(1 - e)||u|| 2 < ||u'|| 2 < (l + e)||n|| 2 ] > 1- 

e 2 \ 2 3 J 

This establishes the Johnson-Lindenstrauss lemma. For a given value of e, the dimensionality 
of the projected space (i.e. O (e _2 logn)) is chosen in order to ensure an inverse polynomial error 
probability which facilitates the application of the union bound to ensure that none of the pairwise 
distances in the n-point set is distorted too much. The result also ensures that even the inner 
products are preserved to an arbitrarily low additive error; this is characterized by the following 
corollary. 

Corollary 1 ([VA06]). Let u,v be unit vectors in W d . Then, for any e > 0, a random projection of 
these vectors to yield the vectors u' and v' respectively satisfies Pr [u ■ v — e < v! ■ v' < u ■ v + e] > 

l _ 4g 2 V 2 3 J 

Proof. Apply Lemma Q] to the vectors u, v and u — v. The result follows from using simple facts 
concerning inner products. □ 

We shall be using the properties of the embedding described above to obtain low distortion 
embeddings for various statistical distance measures. To facilitate the discussion we refer to the 
process of mapping high-dimensional point sets to low-dimensional ones via random projections 
as JL-type embeddings. We now develop some of the terminology that will be used later. In 
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the discussion below, we assume that the histograms to be normalized, i.e., they correspond to 
probability distributions. 

Definition 3 (Representative vector). Given a d- dimensional histogram P = (pi, ■ ■ -Pd), let V^P 
denote the unit vector (<Jp~i, • • • ; ^/Pd)- We shall call this the representative vector of P. 

Definition 4 (a-constrained histogram). A histogram P = (pi,P2, ■ ■ -Pd) is so-id to be a-constrained 
if Pi > § fori = 1,2,..., d. 

This ensures that the a-constrained histograms have a level of "smoothness" to them and are 
not extremely skewed. The following result holds for all a-constrained histograms. 

Observation 1. Given two a-constrained histograms P and Q, the inner product between the 
representative vectors is at least a, i.e., l^\fP,\J~Q^ > a. 

For convenience, we will denote % by (3. A d-dimensional /3-constrained distribution will then 
imply a distribution that is a-constrained with a = (3 ■ d. Since the histograms are normalized, a 
must be less than 1. In other words, (3 < -3. We next examine three statistical distance measures 
starting with the Bhattacharyya class of distance measures. 



3 The Bhattacharyya Class of Distance Measures 

In the field of pattern classification, more specifically Bayesian decision theory, the Bhattacharyya 
bound is an upper-bound on the expected error rate of a Bayesian decision process [DHSOO]. For 
a binary classification task on a feature x with the two categories as oj\ and 0J2 and the likelihood 
parameters characterized by the distributions p{x\uii) for i = 1, 2, the Bhattacharyya bound on the 
error probability of a Bayesian classifier, ignoring the priors, is given by 



Prjerror} < J \/p(x\uJi)p(x\uj2) dx 



This is known as the Bhattacharyya coefficient between the two distributions. For two his- 
tograms P = (pi,p 2 , ■ ■ ■ ,Pd) a nd Q = (gi,q 2 , ---9d) witri 12i=iPi = Th=i ft = 1 and eachpi,^ > 0, 
the Bhattacharyya coefficient [Bha43] is described as 

n 

BC(p,Q) = Y,Vpm 
1=1 

Using this coefficient, two distance measures can be defined as follows. The Bhattacharyya 
distance [Bha43] is defined as follows 

BD(P,Q) = - In BC(P,Q) 

It is easy to see that this measure does not form a metric. Another distance measure in this class, 
namely, the Bellinger distance [SWWJ08] between distributions is defined as 

H(P, Q) = 1 - BC(P, Q) = l(lVP-VQ 



The fact that H(P, Q) is the Euclidean distance between the points yP and y/Q allows us to state 
the following theorem. 
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Theorem 2. The Hellinger distance admits a low distortion dimensionality reduction. 

Proof. Since the Hellinger distance between the two histograms is the Euclidean distance between 
their representative vectors, given a set of histograms, if we subject the corresponding set of rep- 
resentative vectors to a JL-type embedding then Lemma Q] ensures that the Euclidean distance 
between the embedded representative vectors is a 1 ± e approximation of the Hellinger distance 
between the corresponding histograms with high probability. □ 

3.1 Dimensionality Reduction for the Bhattacharyya Distance 

We now consider the possibility of extending this idea to the Bhattacharyya distance as well which 
would provide us with dimensionality reduction for the Bhattacharyya distance. Given a set of 
histograms under the Bhattacharyya distance that are a-constrained and an error parameter e, 
we perform a JL-type embedding with the error parameter appropriately set. This embedding 
constitutes a dimensionality reduction as we impose the Bhattacharyya distance on the new space. 
The following theorem shows that this embedding only incurs a small additive error. 

Theorem 3. A JL-type embedding of a set of a-constrained histograms under the Bhattacharyya 
distance measure with the error parameter set to e' = ^jp incurs only an additive error of e, i.e., if 
P,Q are the initial histograms transformed respectively to P',Q', then 



Taking — ln() throughout and using the definition of the Bhattacharyya distance, we have 



BD(P, Q)-e< BD(P', Q') < BD(P, Q) + e 



Proof. By Corollary [IJ we have the following with 



high probability: 





Since the distributions are a-constrained, we have 





For any x, e 



x > 1 + x. Hence 
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Also, the function f(x) = 2x — In yT~h^J is positive for all x < |. Hence, for ~ < | (which is true 
since e' = ^ and e < 1), we have 



This implies the following bounds on BD(P, Q) 

BD{P, Q)--< BD(P', Q') < BD(P, Q) + — 

a a 

which gives us the desired result since e' = ^p. □ 

The natural question that arises now is whether the Bhattacharyya distance, being a non-metric, 
also admits low distortion embeddings into metric spaces. Next, we develop a proof technique that 
shows that the distortion incurred by any embedding of point sets under the Bhattacharyya distance 
into a metric space can be made arbitrarily large by including appropriately chosen histograms. 

3.2 The Relaxed Triangle Inequality Technique 

In order to present the proof, we first define X-relaxed triangle inequality for a distance measure. 
This essentially parallels the definition of a relaxed metric as defined in [CM08| . 

Definition 5 (A-Relaxed Triangle Inequality). A set X equipped with a distance function d : 
X x X — ► M + U {0}, is said to satisfy the A-relaxed triangle inequality if there exists some 
constant A < 1 such that for all triplets p,q,r £ X, the following holds 

d(p, r) + d(r, q) > A • d(p, q) 

Metrics satisfy the A-relaxed triangle inequality for A = 1. The following result allows us to 
arrive at lower bounds on the distortion of embeddings into metric spaces for a distance measure 
that violates a relaxed triangle inequality. 

Lemma 2. Any embedding of a set X equipped with a distance function d that does not satisfy the 
X-relaxed triangle inequality into a metric space incurs a distortion of at least j. 

Proof. Since (X, d) does not satisfy the A-relaxed triangle inequality, there exist points p,q,s £ X 
such that d(p, s) + d(s, q) < A • d(p, q). Now, let (X, d) be embedded into a metric space (Y, p) via 
the mapping / that incurs a distortion D. This implies that, for all points x,y £ X, we have 

r • d(x, y) < p (/(V), f(y)) <D-r- d(x, y) 

Next, consider the three points p,q,s G X. Since (Y, p) is a metric space, it satisfies the triangle 
inequality for the embeddings of these points. Hence 

P(f(p)J(s))+p(f(s)J(q)) > p(f(p)J(q)) 
rD ■ d(p, s) + rD ■ d(s, q) > r-d(p,q) 
rDX ■ d(p, q) > r ■ d(p, q) 

Hence we have D > \. □ 



In general, if (Y, p), i.e. the space being embedded into, satisfies a A -relaxed triangle inequality, 

V 
A 



then we get a lower bound of 4- on the value of distortion of (X, d) into (Y, p). 
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3.3 Lower Bound on Distortion for Embeddings into Metric Spaces 



We now appeal to the relaxed triangle inequality argument by constructing point sets under 
the Bhattacharyya distance that fail to satisfy the relaxed triangle inequality and then applying 
Lemma [2] to get a lower bound on the distortion. 

The idea is to choose three distributions P, Q, R such that the angle between the representative 
vectors of P and R is almost tt/2 (i.e., the similarity is almost 0) while the angle between the 
representative vectors of P and Q and those of Q and R is much less than tt/2. This ensures 
that BD(P,R) is much larger than BD(P,Q) and BD(Q,R). Our result is characterized by the 
following theorem. 

Theorem 4. There exist d- dimensional (3 -constrained distributions such that any embedding of 
these distributions under the Bhattacharyya distance measure into a metric space must incur a 
distortion of 



Proof. We know that the Bhattacharyya coefficient for two distributions is the inner product be- 
tween their corresponding vector representations on the unit sphere. Recalling the definitions, 

d 




when f3 > Jj- 
when (3 < 4? 



BC(p,q) = ^(VK-v 7 ^) 



i=i 



BD{p,q) = -In BC[p,q) 



Consider the following /3-constrained distributions which satisfy the above properties : 



P = -l)p,p,...,p) 



Q 




d) 

(d-m 



R 



(PA 



,P) 



Now we have 





^l-(d-l)P + ^P(d-l) 
Vd 



Applying Lemma [2] we get the following bound on the distortion 
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Since (3 > and 1 — (d — > ^, we have \J\ — (d — Tjfi + y/]3(d — 1) > \- I n order to get a lower 
bound on the numerator, we need an upper bound on (d — 2)(3 + 2y / /3y / l — (d — Using the 
fact that 1 — (d — l)/3 < 1, we get an upper bound of d/3 + 2\//? on the above expression. In case 

D -- 

Otherwise, d(3 + \fj3 < 3\fj3, which implies that 

d = n 




In V 

111 13 



lnd / 

□ 

In the following section, we demonstrate that this bound is tight up to a O(dlnd) factor. 
3.4 A Metric Embedding for the Bhattacharyya distance 

In this section, we first show that the Bhattacharyya distance is very closely related to the Hellinger 
distance measure as previously defined. Since the Hellinger distance forms a metric in the positive 
orthant, this allows us to get an upper bound on the distortion which, for a fixed dimension, almost 
matches the lower bound. 

Theorem 5. For any two d- dimensional (3 '-constrained distributions P and Q with [3 < we 
have 

H(P,Q) < BD(P,Q) < -J^M^L^HiRQ) 

Proof. For two distributions P, Q, recall 

BC(P,Q) = 1-H(P,Q) 

/ d \ 



BD(P,Q) = -In \J2VPiV¥i J 

= -hx(l -H(P,Q)) 
^ H(P,Q) k 

^ k 

k=i 

To arrive at the lower bound we truncate the infinite series at the first term. For the upper 
bound, we need to use the fact that the function f(x) = — ln(l — x) is convex. The maximum 
Hellinger distance between any two /^-constrained distributions is 2(-^/l — (d — 1)(3 — (3) 2 . Let, 
a = (y^l — (d — 1)(3 — (3) 2 . Due to convexity of /, the line mx lies above the curve — ln(l — x) 
where m = Therefore, we have 

BD(P, Q) = - In (1 - H(P, Q)) < 1 In (j^) H(P, Q) 
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Also, l-a= (d - 1)0 + 2/Vl - (d - 1)/? - P 2 > (cZ - l)/3 since 2y/l - {d - l)(3 - (3 > 0. Thus we 
get 

BO(P '« £ T^ ln ((^W) if(P ' <3) 

This implies that the identity embedding of a point set under the Bhattacharyya distance into one 
under the Hellinger distance incurs a distortion of j^jM m (d-i)0 ■ '-' 

For constant d and sufficiently small (3, the lower bound presented in Section [3.31 is essentially 
£1 (in , whereas the embedding presented in this section has a distortion of O (in which 
implies that the lower bound is tight. In general it can be seen using Theorems 2] and that for 
sufficiently small (3 the lower bound presented is tight up to a factor of 0(d In d). Further, the 
result presented in Theorem [2] can be used to perform dimensionality reduction as well. 



4 The Kullback-Leibler Divergence 

The Kullback-Leibler divergence arises in information theoretic settings where probability distri- 
butions are evaluated in terms of their entropy or the amount of information they contain. The 
Kullback-Leibler divergence measures the difference between the relative entropy and the self en- 
tropy of two distributions. Given two histograms P = {pi,P2, ■ ■ ■ ,Pd} and Q = {q2, q2 ■ ■ ■ Qd}i the 
Kullback-Leibler divergence between the two distributions is defined as 

d 

KL(P,Q) = J2Pi^~ 

i=i Qi 

Informally, it gives an asymptotic lower bound on the overhead incurred in terms of encoding length 
if one were to encode data assuming it to be generated by a random source characterized by Q 
when it fact the source is characterized by P. The Kullback-Leibler divergence is non-symmetric 
and unbounded, i.e., for any given c > 0, one can construct histograms whose Kullback-Leibler 
divergence exceeds c. In order to avoid these singularities, we assume that the histograms are 
/3-constrained. 

Lemma 3. Given two (3 -constrained histograms P, Q, < KL(P,Q) < ln-|. 

Proof. The lower bound follows directly from Jensen inequality [DHS00J. For the upper bound, 
since we know that — < 4 for all i = 1, 2, . . . , d, we can write 

d d 1 

KL(P, Q) = £ ft In £ < J>ln - = In - 

i=l qi i=i P P 

For 13 = 0(1), the 

upper bound is tight up to a constant factor. □ 

In the following discussion, we show that low distortion embeddings into metric spaces cannot 
exist for the Kullback-Leibler divergence. In order to show this, we utilize the relaxed triangle 
inequality presented in Section 13.21 and explicitly construct point sets that violate the relaxed 
triangle inequality. We also develop another technique that exploits the fact that the Kullback- 
Leibler divergence is not symmetric. We first present this new proof technique before moving on 
to prove bounds utilizing both the proof techniques. 
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4.1 The Asymmetry Technique 

We present a general result that can be used to prove lower bounds on the embedding distortion 
when we intend to embed a non-symmetric distance measure into a metric space. The idea is to 
exploit the existence of two points p, q for which there is a large gap between the distances between 
p to q and q to p. This idea is formalized in Lemma |4] using the following definition. 

Definition 6 (7-Relaxed Symmetry). A set X equipped with a distance function d : X x X — > 
M + U {0} ; is said to satisfy 7-relaxed symmetry if there exists 7 > such that for all point pairs 
p,q £ X, the following holds 

\d(p,q) - d(q,p)\ < 7 
Note that metrics satisfy the 7 -relaxed symmetry for 7 = 0. 

Lemma 4. Given a set X equipped with a distance function d that does not satisfy the ^-relaxed 
symmetry such that d(x, y) < M for all x, y £ X, any embedding of X into a metric space incurs 
a distortion of at least 1 + . 

Proof. Since (X, d) does not satisfy the 7-relaxed symmetry, there exist points p,q £ X such that 
\d(p,q) — d(q,p)\ > 7. If (X, d) is embeddable via a mapping / into a metric space (Y, p) with a 
distortion of D then for some constant r > 0, 

r • d(x, y) < p (f(x), f{y)) <D-r- d(x, y) 

Without loss of generality, assume that d(p,q) > d(q,p). Since (Y, p) is a metric space, it is 
symmetric, which implies that p(f(p),f(q)) = p(f(l),f(p))- Since d(p,q) > d(q,p) +7, we have 

P{f{p),f(q)) - r ■ d{q,p) > r-7 

p(f(p)J(p)) 1 r-7 



r-d(q,p) r-d(p,q) 

D > p(f(p),f(p)) i + 

~ r-d(q,p) M 
This implies that the distortion is at least 1 + □ 

4.2 Lower Bounds on Distortion for Embeddings into Metric Spaces 

We now apply the above lemma to show that one cannot hope to obtain an almost isometric 
embedding of the Kullback-Leibler divergence into any metric space. We show the existence of two 
histograms P and Q such that \KL(P, Q) — KL(Q,P)\ is large. The result is formally stated in 
the following theorem. 

Theorem 6. For sufficiently large d and small (3, there exists a set S of d- dimensional (3 -constrained 
histograms and a constant c > such that any embedding of S into a metric space incurs a distortion 
of at least 1 + c. 

Proof. The idea is to choose two distributions which violate a <5-relaxed symmetry for large 5. 
Consider the following two distributions where f3 7^ \ 



P 



1 1 1 

d' <i' ' d 
Q = {l-(d-l)P,p,...,P} 
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Define AKL(P, Q) = \KL(P,Q) - KL(Q,P)\. Since we have 

KL(P,Q) = ^l-lV-l-iln(d(l-(d-l)/3)) 
KL(Q,P) = fi{d- l)ln/3d+ (1- (d- l)/3)ln(d(l- (d- 1)0)) 
by rearranging the terms we get 

AKL(P, Q) = (l - i + (3(d - 1)) In - Q + 1 - (d - l)(3^j ln(d(l - (d - 1)0)) 

For sufficiently large d, we consider the situation when (3 is large i.e. say (3 = @py- In this case 
we have AKL(P,Q) = O (in-^j — O(lnd) = 0(ln<i). Since for /3-constrained distributions, the 
maximum inter point distance i.e. M = In ^, we get the a lower bound on the distortion by using 
LemmaH]as 1 + which is 1 + 0(1) as (3 = Qpy- 

For small (3 - say (3 = o ( Jj) , since d > 2 we get 



AKL(P, Q) > - In -J- - -lnd= - In— r 

y ^' ~ 2 (3d 2 2 j3d" 



Using a similar argument as above, we get a lower bound on the distortion as 

d = i + n ( l ^J^)\ = i + n{i) 

V ln /3 / 

Hence we conclude that any embedding of point sets which contain the points P and Q into a 
metric space must have a distortion D of at least 1 + c for some constant c. □ 

The above argument shows the impossibility of obtaining almost isometric (i.e., with distor- 
tion arbitrarily closed to 1) embeddings of /3-constrained histograms under the Kullback-Leibler 
divergence. Since A(P, Q) can be at most ln^, for any two /^-constrained distributions, using this 
technique, a different choice of points can at best provide a constant factor improvement over the 
above bounds. It should be noted that one cannot hope to get significant improvement on the above 
bounds via this technique by choosing two different points in the probability simplex. However, 
an application of the relaxed triangle inequality technique shows that the situation is much worse. 
We will now demonstrate that the Kullback-Leibler divergence admits point sets which violate the 
A-relaxed triangle inequality, where A can be made arbitrarily small. This implies (by applying 
Lemma [2]) that the distortion into any metric space can be made arbitrarily large. 

Theorem 7. For sufficiently large d, there exist d- dimensional (3 -constrained distributions such that 
embedding these under the Kullback-Leibler divergence into a metric space must incur a distortion 

offl 

Proof. We construct three /3-constrained distributions that fail to satisfy a relaxed triangle inequal- 
ity under the Kullback-Leibler divergence. Consider the following distributions. The parameters e 
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and c will be fixed later. Let 

P 



11 I 

d' d' ' d / 
Q = (l-(d-l)e,e,...,e) 
R = (l-(d-l)e- c ,e- c ,...,e 



c \ 



where -7 > e > e c >/3. We have 



KL(P,Q) = (1-I^jln-^ + iln ' 



dj de d d(l-(d-l)e) 



, 1 
< In — 

de 



KL(Q, R) = (l-(d-l)e)ln 1 f ^ + (d - l)eln(ee c 

1 — (a — l)e c 

< (d- l)elne + (d - l)ce 
ifL(P,i?) = (l-iy n -l- + iln ' 



Using the above inequalities, 

A 



d) de- c d d(l - (d- l)e" c ) 
ft(c-md) - 0(1) 

kl(p,q) + kl(q,r) 



'ln^ + (d- l)elne + (d- l)ce 
c — lnd 



Hence, any point set containing these three points violates the A-relaxed triangle inequality. Now, 
using Lemma [H the distortion for the Kullback-Leibler divergence is 

„ 1 „ / c-lnd \ 

D > — = O 



A \ In ^ + de In e + dee 



Since e In e < 0, hence 



D = n 



lnd 



In i + dee 



Consider the function f(c,e) = ■ c i lnf f . It turns out that jJ- > for all values of c. Hence, the 
maxima is achieved at the maximum value of c which is In 4 . Furthermore we find that = at 

p Oe 

e = 4- = „ 1 ■ It can be confirmed that this extrema is actually a maxima. For a fixed value of d 

we can choose /3 small enough to make sure that the value of e is at least (3. For these values of c 
and e we get the lower bound as 



D = Q 



In ( din i J +1 
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Thus, the result follows. □ 

Interpreting the Lower Bounds: The above bounds show how the Kullback-Leibler divergence 
behaves near the uniform distribution and near the boundaries of the probability simplex. The 
bounds indicate that near the uniform distribution, asymmetry makes the Kullback-Leibler diver- 
gence hard to approximate by a metric but as we move away from the uniform distribution the 
hardness is because of the violation of the relaxed triangle inequality. More formally, it can be 
seen that for point sets which are /3-constrained for large (3 (say (3 = 0. (¥)), the lower bound using 
the asymmetry argument gives a 1 + 0(1) bound whereas the triangle inequality argument gives a 
o(l) bound. For smaller (3 (say (3 = o (i)) we get a better lower bound using the relaxed triangle 

In 4 

inequality argument. This lower bound behaves asymptotically as lnln i which can be made arbi- 
trarily large. Note that the asymmetry argument gives a constant lower bound even for very small 
(3 but this only shows the ineffectiveness of this proof technique for small (3. 

4.3 An Embedding for the Kullback-Leibler divergence 

In this section, we examine the properties of the identity embedding of point sets under the 
Kullback-Leibler divergence into the Z| distance measure. Once we have obtained this result one 
can again apply JL-type embeddings to achieve dimensionality reduction as well. To obtain this 
bound we use a well known inequality in information theory due to Pinsker |TopOO| . Our result is 
characterized by the following theorem. 

Theorem 8. For any two d- dimensional (3-constrained distributions P and Q, 

& p > Q) < kl(p, Q) < (773 + Jk) %(P, Q) 



2 ~ v ~ \2(3 3/3 5 , 
Proof. The lower bound essentially follows from Pinsker's inequality |TopOO| which states that 

1 ^^-<KL(P,Q) 

Since li(P,Q) > hiP^Q), the lower bound follows automatically. For the upper bound, we use 
Taylor's Theorem on the expression for KL(P, Q) 



KL(P,Q) = £ Pi ln - 
i=l Qi 



i=l 

By Taylor's Theorem, there exists e such that |e| < Pl ~ 9z , for which the following holds : 

d 



Pi 

KL(P,Q) = X>(^ + (ELZ|i>! + 1 .<*f*£) 

^ V ft M (i-er m J 
( (Pi- qi) 2 1 {Pi - g«) 3 

2 Pi + (1-6)3- 3p 2 
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Since |e| < 
Therefore, 



Pi-Qi 
Pi 



>(i^F - W> and 



Pi- Hi 
Pi 



< we have ^— ^ 



(Px-Qi) 2 . (Pi-Qi) 
3pi Pi 



<- (Pi-Qi) 



KL(P,Q)<[^ + ^)H(P,Q) 



The distortion of this identity embedding is O 



□ 



Although this embedding does not give us a provably small distortion, it can still be used in 
practical situations since the embedding is into l\ space and it allows for low distortion dimension- 
ality reduction using the Johnson-Lindenstrauss Lemma. 



5 The Class of Quadratic Form Distance Measures 

Consider the vector space M. d . Given adxd positive definite matrix A, the Quadratic Form Distance 
measures (QFDs) define a distance measure over W d . If x,y G M. d , then Q^{x,y) is defined to be 

Q A {x,y) = yj(x- y) T A(x - y) 

These distance measures can be seen as acting on a distorted Euclidean space. When A is a 
diagonal matrix with positive entries, the corresponding QFDs are weighted Euclidean distance 
measures. The family of quadratic form distances are actually defined for general positive semi- 
definite matrices. However, for positive definite matrices, the distance measure Qa forms a metric. 
We now show that every QFD can be embedded into a low-dimensional space with low distortion 
in the inter-point distances. 

Theorem 9. The family of quadratic form distance measures admit a low distortion JL-type em- 
bedding. 

Proof. Every quadratic form distance measure forming a metric is characterized by a square matrix 
A which is positive definite. However, every positive definite matrix A can be subjected to a 
Cholesky Decomposition of the form A = L T L [GL96]. Consider the transformation x i — > R{Lx) 
where R is the random projection matrix involved in the Johnson-Lindenstrauss Lemma. Consider 
two points x,j/6 M. d , then 

Q A (x, y) = (x - y) T A{x - y) = ^ (L(x - y)) T {L{x - y)) 

which is the Euclidean distance between the points Lx and Ly. Thus, the proposed transformation 
gives us a low distortion embedding since the problem has been reduced to that in the undistorted 
Euclidean space where the Johnson-Lindenstrauss Lemma is applicable. □ 

The previous theorem can be easily seen to provide a simple algorithm to reduce the dimension- 
ality of the data points and still preserve the distance as given by the QFD. Given this formulation 
we now look at the Mahalanobis distance which is a special case of the QFD measure where the 
positive definite matrix is taken to be the covariance matrix of a normal multivariate probability 
distribution. Given the construction of low distortion embeddings for QFDs, the following result is 
immediate. 

Corollary 2. The Mahalanobis metric admits a randomized polynomial time low distortion JL-type 
embedding. 
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6 Conclusions 



We have investigated various statistical distance measures from the point of view of dimensionality 
reduction and embeddability into metric spaces. We examined and presented novel dimensionality 
reduction techniques for the Bhattacharyya distance, the Hellinger distance and the Mahalanobis 
distance measure using the Johnson-Lindenstrauss Lemma. 

We also examined the question of finding low distortion embeddings of the Bhattacharyya 
distance and the Kullback-Leibler divergence which are non-metric distance measures into metric 
spaces. We developed two novel techniques that can be used to prove lower bounds on the distortion 
that must be incurred by any such embedding. 

For the Bhattacharyya distance, we demonstrated that the lower bound presented is almost tight 
by analyzing its relationship between the Hellinger distance which forms a metric. We performed 
a similar exercise for the Kullback-Leibler divergence to relate it with the l\ measure. Although 
it does not match the lower bounds, it is of practical significance since it allows for dimensionality 
reduction. 

The question that we leave open is that of dimensionality reduction under the Kullback-Leibler 
divergence. Our preliminary investigations show that this is unlikely if one wants the resulting 
embedded objects to lie on the low-dimensional probability simplex. Also, it remains to be seen 
if the methods developed in this paper, viz., the adaptations of random projections to various 
distance measures and the proof techniques developed, find applications for other widely used 
statistical distance measures, many of which are non-metrics. 
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